Predict Bike Sharing Demand with AutoGluon Template¶

Project: Predict Bike Sharing Demand with AutoGluon¶

This notebook is a template with each step that you need to complete for the project.

Please fill in your code where there are explicit ? markers in the notebook. You are welcome to add more cells and code as you see fit.

Once you have completed all the code implementations, please export your notebook as a HTML file so the reviews can view your code. Make sure you have all outputs correctly outputted.

File-> Export Notebook As... -> Export Notebook as HTML

There is a writeup to complete as well after all code implememtation is done. Please answer all questions and attach the necessary tables and charts. You can complete the writeup in either markdown or PDF.

Completing the code template and writeup template will cover all of the rubric points for this project.

The rubric contains "Stand Out Suggestions" for enhancing the project beyond the minimum requirements. The stand out suggestions are optional. If you decide to pursue the "stand out suggestions", you can include the code in this notebook and also discuss the results in the writeup file.

Step 1: Create an account with Kaggle¶

Create Kaggle Account and download API key¶

Below is example of steps to get the API username and key. Each student will have their own username and key.

  1. Open account settings. kaggle1.png kaggle2.png
  2. Scroll down to API and click Create New API Token. kaggle3.png kaggle4.png
  3. Open up kaggle.json and use the username and key. kaggle5.png

Step 2: Download the Kaggle dataset using the kaggle python library¶

Open up Sagemaker Studio and use starter template¶

  1. Notebook should be using a ml.t3.medium instance (2 vCPU + 4 GiB)
  2. Notebook should be using kernal: Python 3 (MXNet 1.8 Python 3.7 CPU Optimized)

Install packages¶

In [ ]:
!pip install pydantic==1.10.2

!pip install -U pip
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0" bokeh==2.0.1
!pip install autogluon --no-cache-dir
# Without --no-cache-dir, smaller aws instances may have trouble installing

!pip install -U python-dotenv
!pip install -U kaggle
!pip install -U pandas-profiling
!pip install ipywidgets==7.7.2

Setup Kaggle API Key¶

In [2]:
# create the .kaggle directory and an empty kaggle.json file
!mkdir -p /root/.kaggle
!touch /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
In [3]:
from dotenv import load_dotenv 
from os import environ
load_dotenv()
Out[3]:
True
In [4]:
# Fill in your user name and key from creating the kaggle account and API token file
import json
kaggle_username = environ.get("KAGGLE_USERNAME")
kaggle_key = environ.get("KAGGLE_KEY")

# Save API token the kaggle.json file
with open("/root/.kaggle/kaggle.json", "w") as f:
    f.write(json.dumps({"username": kaggle_username, "key": kaggle_key}))

Download and explore dataset¶

Go to the bike sharing demand competition and agree to the terms¶

kaggle6.png

In [5]:
# Download the dataset, it will be in a .zip file so you'll need to unzip it as well.
#!kaggle competitions download -c bike-sharing-demand
# If you already downloaded it you can use the -o command to overwrite the file
!unzip -o bike-sharing-demand.zip
Archive:  bike-sharing-demand.zip
  inflating: sampleSubmission.csv    
  inflating: test.csv                
  inflating: train.csv               
In [6]:
import pandas as pd
from autogluon.tabular import TabularPredictor
import bokeh
In [7]:
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
In [8]:
# Create the train dataset in pandas by reading the csv
# Set the parsing of the datetime column so you can use some of the `dt` features in pandas later
train = pd.read_csv("train.csv", parse_dates=["datetime"])
train.head()
Out[8]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1
In [9]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB
In [10]:
# Simple output of the train dataset to view some of the min/max/varition of the dataset features.
train.describe()
Out[10]:
season holiday workingday weather temp atemp humidity windspeed casual registered count
count 10886.000000 10886.000000 10886.000000 10886.000000 10886.00000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000
mean 2.506614 0.028569 0.680875 1.418427 20.23086 23.655084 61.886460 12.799395 36.021955 155.552177 191.574132
std 1.116174 0.166599 0.466159 0.633839 7.79159 8.474601 19.245033 8.164537 49.960477 151.039033 181.144454
min 1.000000 0.000000 0.000000 1.000000 0.82000 0.760000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 0.000000 1.000000 13.94000 16.665000 47.000000 7.001500 4.000000 36.000000 42.000000
50% 3.000000 0.000000 1.000000 1.000000 20.50000 24.240000 62.000000 12.998000 17.000000 118.000000 145.000000
75% 4.000000 0.000000 1.000000 2.000000 26.24000 31.060000 77.000000 16.997900 49.000000 222.000000 284.000000
max 4.000000 1.000000 1.000000 4.000000 41.00000 45.455000 100.000000 56.996900 367.000000 886.000000 977.000000
In [11]:
# Create the test pandas dataframe in pandas by reading the csv, remember to parse the datetime!
test = pd.read_csv("test.csv", parse_dates=["datetime"])
test.head()
Out[11]:
datetime season holiday workingday weather temp atemp humidity windspeed
0 2011-01-20 00:00:00 1 0 1 1 10.66 11.365 56 26.0027
1 2011-01-20 01:00:00 1 0 1 1 10.66 13.635 56 0.0000
2 2011-01-20 02:00:00 1 0 1 1 10.66 13.635 56 0.0000
3 2011-01-20 03:00:00 1 0 1 1 10.66 12.880 56 11.0014
4 2011-01-20 04:00:00 1 0 1 1 10.66 12.880 56 11.0014
In [12]:
# Same thing as train and test dataset
submission = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission.head()
Out[12]:
datetime count
0 2011-01-20 00:00:00 0
1 2011-01-20 01:00:00 0
2 2011-01-20 02:00:00 0
3 2011-01-20 03:00:00 0
4 2011-01-20 04:00:00 0

Step 3: Train a model using AutoGluon’s Tabular Prediction¶

Requirements:

  • We are predicting count, so it is the label we are setting.
  • Ignore casual and registered columns as they are also not present in the test dataset.
  • Use the root_mean_squared_error as the metric to use for evaluation.
  • Set a time limit of 10 minutes (600 seconds).
  • Use the preset best_quality to focus on creating the best model.
In [13]:
learner_kwargs = {
    "ignored_columns": ["casual", "registered"]
}

predictor = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression", 
    eval_metric="root_mean_squared_error").fit(train_data=train, time_limit=600, presets="best_quality")
No path specified. Models will be saved in: "AutogluonModels/ag-20230104_020241/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20230104_020241/"
AutoGluon Version:  0.6.1
Python Version:     3.7.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Oct 26 20:36:53 UTC 2022
Train Data Rows:    10886
Train Data Columns: 11
Label Column: count
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Dropping user-specified ignored columns: ['casual', 'registered']
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    3070.91 MB
	Train Data (Original)  Memory Usage: 0.78 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
/usr/local/lib/python3.7/site-packages/autogluon/features/generators/datetime.py:59: FutureWarning: casting datetime64[ns, UTC] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
  good_rows = series[~series.isin(bad_rows)].astype(np.int64)
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('datetime', []) : 1 | ['datetime']
		('float', [])    : 3 | ['temp', 'atemp', 'windspeed']
		('int', [])      : 5 | ['season', 'holiday', 'workingday', 'weather', 'humidity']
	Types of features in processed data (raw dtype, special dtypes):
		('float', [])                : 3 | ['temp', 'atemp', 'windspeed']
		('int', [])                  : 3 | ['season', 'weather', 'humidity']
		('int', ['bool'])            : 2 | ['holiday', 'workingday']
		('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
	0.5s = Fit runtime
	9 features in original data used to generate 13 features in processed data.
	Train Data (Processed) Memory Usage: 0.98 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.54s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
	To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 399.54s of the 599.45s of remaining time.
	-101.5462	 = Validation score   (-root_mean_squared_error)
	0.03s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 396.26s of the 596.17s of remaining time.
	-84.1251	 = Validation score   (-root_mean_squared_error)
	0.03s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 395.9s of the 595.81s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-131.4609	 = Validation score   (-root_mean_squared_error)
	64.13s	 = Training   runtime
	5.95s	 = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 321.02s of the 520.94s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-131.0542	 = Validation score   (-root_mean_squared_error)
	29.44s	 = Training   runtime
	1.29s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 287.3s of the 487.22s of remaining time.
	-116.5443	 = Validation score   (-root_mean_squared_error)
	10.62s	 = Training   runtime
	0.52s	 = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 273.53s of the 473.44s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-130.5034	 = Validation score   (-root_mean_squared_error)
	197.78s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ... Training model for up to 72.0s of the 271.91s of remaining time.
	-124.5881	 = Validation score   (-root_mean_squared_error)
	4.85s	 = Training   runtime
	0.51s	 = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 63.99s of the 263.9s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-137.5911	 = Validation score   (-root_mean_squared_error)
	77.11s	 = Training   runtime
	0.41s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 182.66s of remaining time.
	-84.1251	 = Validation score   (-root_mean_squared_error)
	0.49s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting 9 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 182.1s of the 182.08s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-60.2855	 = Validation score   (-root_mean_squared_error)
	55.13s	 = Training   runtime
	3.4s	 = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 121.27s of the 121.25s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-55.161	 = Validation score   (-root_mean_squared_error)
	24.69s	 = Training   runtime
	0.22s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 92.49s of the 92.47s of remaining time.
	-53.3704	 = Validation score   (-root_mean_squared_error)
	26.18s	 = Training   runtime
	0.6s	 = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 63.25s of the 63.24s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-55.6524	 = Validation score   (-root_mean_squared_error)
	62.87s	 = Training   runtime
	0.06s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the -3.59s of remaining time.
	-53.0732	 = Validation score   (-root_mean_squared_error)
	0.28s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 604.06s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230104_020241/")

Review AutoGluon's training run with ranking of models that did the best.¶

In [14]:
# Get detailed info of the predictor
pred_info = predictor.info()
with open('docs/pred_info.json', 'w') as convert_file:
	convert_file.write(json.dumps(pred_info, default=str))
In [15]:
#from bokeh.plotting import figure, show
#from bokeh.io import output_notebook
#output_notebook()
predictor.fit_summary(show_plot=False)
*** Summary of fit() ***
Estimated performance of each model:
                     model   score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0      WeightedEnsemble_L3  -53.073174      13.265574  553.142472                0.000775           0.277415            3       True         14
1   RandomForestMSE_BAG_L2  -53.370416       9.581156  410.168097                0.600952          26.182611            2       True         12
2          LightGBM_BAG_L2  -55.160954       9.202732  408.676177                0.222528          24.690691            2       True         11
3          CatBoost_BAG_L2  -55.652386       9.038394  446.859077                0.058190          62.873591            2       True         13
4        LightGBMXT_BAG_L2  -60.285482      12.383130  439.118164                3.402925          55.132678            2       True         10
5    KNeighborsDist_BAG_L1  -84.125061       0.103782    0.029244                0.103782           0.029244            1       True          2
6      WeightedEnsemble_L2  -84.125061       0.104525    0.522287                0.000743           0.493043            2       True          9
7    KNeighborsUnif_BAG_L1 -101.546199       0.104947    0.030970                0.104947           0.030970            1       True          1
8   RandomForestMSE_BAG_L1 -116.544294       0.521331   10.616067                0.521331          10.616067            1       True          5
9     ExtraTreesMSE_BAG_L1 -124.588053       0.513939    4.845773                0.513939           4.845773            1       True          7
10         CatBoost_BAG_L1 -130.503441       0.092072  197.782267                0.092072         197.782267            1       True          6
11         LightGBM_BAG_L1 -131.054162       1.289819   29.443498                1.289819          29.443498            1       True          4
12       LightGBMXT_BAG_L1 -131.460909       5.948354   64.125232                5.948354          64.125232            1       True          3
13  NeuralNetFastAI_BAG_L1 -137.591119       0.405960   77.112434                0.405960          77.112434            1       True          8
Number of models trained: 14
Types of models trained:
{'StackerEnsembleModel_KNN', 'StackerEnsembleModel_LGB', 'StackerEnsembleModel_XT', 'StackerEnsembleModel_NNFastAiTabular', 'WeightedEnsembleModel', 'StackerEnsembleModel_CatBoost', 'StackerEnsembleModel_RF'}
Bagging used: True  (with 8 folds)
Multi-layer stack-ensembling used: True  (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('float', [])                : 3 | ['temp', 'atemp', 'windspeed']
('int', [])                  : 3 | ['season', 'weather', 'humidity']
('int', ['bool'])            : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20230104_020241/SummaryOfModels.html
*** End of fit() summary ***
Out[15]:
{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
  'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
  'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
  'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
  'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel',
  'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
  'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
  'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif_BAG_L1': -101.54619908446061,
  'KNeighborsDist_BAG_L1': -84.12506123181602,
  'LightGBMXT_BAG_L1': -131.46090891834504,
  'LightGBM_BAG_L1': -131.054161598899,
  'RandomForestMSE_BAG_L1': -116.54429428704391,
  'CatBoost_BAG_L1': -130.50344119744508,
  'ExtraTreesMSE_BAG_L1': -124.58805258915959,
  'NeuralNetFastAI_BAG_L1': -137.59111927600816,
  'WeightedEnsemble_L2': -84.12506123181602,
  'LightGBMXT_BAG_L2': -60.285481674376115,
  'LightGBM_BAG_L2': -55.160953725178764,
  'RandomForestMSE_BAG_L2': -53.37041620071757,
  'CatBoost_BAG_L2': -55.65238600039221,
  'WeightedEnsemble_L3': -53.0731743886261},
 'model_best': 'WeightedEnsemble_L3',
 'model_paths': {'KNeighborsUnif_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/KNeighborsUnif_BAG_L1/',
  'KNeighborsDist_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/KNeighborsDist_BAG_L1/',
  'LightGBMXT_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/LightGBMXT_BAG_L1/',
  'LightGBM_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/LightGBM_BAG_L1/',
  'RandomForestMSE_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/RandomForestMSE_BAG_L1/',
  'CatBoost_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/CatBoost_BAG_L1/',
  'ExtraTreesMSE_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/ExtraTreesMSE_BAG_L1/',
  'NeuralNetFastAI_BAG_L1': 'AutogluonModels/ag-20230104_020241/models/NeuralNetFastAI_BAG_L1/',
  'WeightedEnsemble_L2': 'AutogluonModels/ag-20230104_020241/models/WeightedEnsemble_L2/',
  'LightGBMXT_BAG_L2': 'AutogluonModels/ag-20230104_020241/models/LightGBMXT_BAG_L2/',
  'LightGBM_BAG_L2': 'AutogluonModels/ag-20230104_020241/models/LightGBM_BAG_L2/',
  'RandomForestMSE_BAG_L2': 'AutogluonModels/ag-20230104_020241/models/RandomForestMSE_BAG_L2/',
  'CatBoost_BAG_L2': 'AutogluonModels/ag-20230104_020241/models/CatBoost_BAG_L2/',
  'WeightedEnsemble_L3': 'AutogluonModels/ag-20230104_020241/models/WeightedEnsemble_L3/'},
 'model_fit_times': {'KNeighborsUnif_BAG_L1': 0.030969619750976562,
  'KNeighborsDist_BAG_L1': 0.029244422912597656,
  'LightGBMXT_BAG_L1': 64.12523245811462,
  'LightGBM_BAG_L1': 29.443498373031616,
  'RandomForestMSE_BAG_L1': 10.616066694259644,
  'CatBoost_BAG_L1': 197.78226709365845,
  'ExtraTreesMSE_BAG_L1': 4.845773220062256,
  'NeuralNetFastAI_BAG_L1': 77.11243438720703,
  'WeightedEnsemble_L2': 0.4930429458618164,
  'LightGBMXT_BAG_L2': 55.13267803192139,
  'LightGBM_BAG_L2': 24.690690755844116,
  'RandomForestMSE_BAG_L2': 26.182611227035522,
  'CatBoost_BAG_L2': 62.87359070777893,
  'WeightedEnsemble_L3': 0.27741503715515137},
 'model_pred_times': {'KNeighborsUnif_BAG_L1': 0.10494709014892578,
  'KNeighborsDist_BAG_L1': 0.10378217697143555,
  'LightGBMXT_BAG_L1': 5.948354005813599,
  'LightGBM_BAG_L1': 1.2898194789886475,
  'RandomForestMSE_BAG_L1': 0.5213305950164795,
  'CatBoost_BAG_L1': 0.0920724868774414,
  'ExtraTreesMSE_BAG_L1': 0.5139386653900146,
  'NeuralNetFastAI_BAG_L1': 0.4059596061706543,
  'WeightedEnsemble_L2': 0.0007426738739013672,
  'LightGBMXT_BAG_L2': 3.402925491333008,
  'LightGBM_BAG_L2': 0.22252774238586426,
  'RandomForestMSE_BAG_L2': 0.6009519100189209,
  'CatBoost_BAG_L2': 0.05818963050842285,
  'WeightedEnsemble_L3': 0.0007748603820800781},
 'num_bag_folds': 8,
 'max_stack_level': 3,
 'model_hyperparams': {'KNeighborsUnif_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'KNeighborsDist_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'LightGBMXT_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'RandomForestMSE_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'CatBoost_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'ExtraTreesMSE_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'NeuralNetFastAI_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L2': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBMXT_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'RandomForestMSE_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'CatBoost_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L3': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True}},
 'leaderboard':                      model   score_val  pred_time_val    fit_time  \
 0      WeightedEnsemble_L3  -53.073174      13.265574  553.142472   
 1   RandomForestMSE_BAG_L2  -53.370416       9.581156  410.168097   
 2          LightGBM_BAG_L2  -55.160954       9.202732  408.676177   
 3          CatBoost_BAG_L2  -55.652386       9.038394  446.859077   
 4        LightGBMXT_BAG_L2  -60.285482      12.383130  439.118164   
 5    KNeighborsDist_BAG_L1  -84.125061       0.103782    0.029244   
 6      WeightedEnsemble_L2  -84.125061       0.104525    0.522287   
 7    KNeighborsUnif_BAG_L1 -101.546199       0.104947    0.030970   
 8   RandomForestMSE_BAG_L1 -116.544294       0.521331   10.616067   
 9     ExtraTreesMSE_BAG_L1 -124.588053       0.513939    4.845773   
 10         CatBoost_BAG_L1 -130.503441       0.092072  197.782267   
 11         LightGBM_BAG_L1 -131.054162       1.289819   29.443498   
 12       LightGBMXT_BAG_L1 -131.460909       5.948354   64.125232   
 13  NeuralNetFastAI_BAG_L1 -137.591119       0.405960   77.112434   
 
     pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  \
 0                 0.000775           0.277415            3       True   
 1                 0.600952          26.182611            2       True   
 2                 0.222528          24.690691            2       True   
 3                 0.058190          62.873591            2       True   
 4                 3.402925          55.132678            2       True   
 5                 0.103782           0.029244            1       True   
 6                 0.000743           0.493043            2       True   
 7                 0.104947           0.030970            1       True   
 8                 0.521331          10.616067            1       True   
 9                 0.513939           4.845773            1       True   
 10                0.092072         197.782267            1       True   
 11                1.289819          29.443498            1       True   
 12                5.948354          64.125232            1       True   
 13                0.405960          77.112434            1       True   
 
     fit_order  
 0          14  
 1          12  
 2          11  
 3          13  
 4          10  
 5           2  
 6           9  
 7           1  
 8           5  
 9           7  
 10          6  
 11          4  
 12          3  
 13          8  }
In [16]:
predictor.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
Out[16]:
<AxesSubplot:xlabel='model'>
In [18]:
# Save validation scores
leaderboard = predictor.leaderboard()
leaderboard["description"] = "baseline with raw features"
leaderboard.to_csv("docs/leaderboard.csv", index=False)
                     model   score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0      WeightedEnsemble_L3  -53.073174      13.265574  553.142472                0.000775           0.277415            3       True         14
1   RandomForestMSE_BAG_L2  -53.370416       9.581156  410.168097                0.600952          26.182611            2       True         12
2          LightGBM_BAG_L2  -55.160954       9.202732  408.676177                0.222528          24.690691            2       True         11
3          CatBoost_BAG_L2  -55.652386       9.038394  446.859077                0.058190          62.873591            2       True         13
4        LightGBMXT_BAG_L2  -60.285482      12.383130  439.118164                3.402925          55.132678            2       True         10
5    KNeighborsDist_BAG_L1  -84.125061       0.103782    0.029244                0.103782           0.029244            1       True          2
6      WeightedEnsemble_L2  -84.125061       0.104525    0.522287                0.000743           0.493043            2       True          9
7    KNeighborsUnif_BAG_L1 -101.546199       0.104947    0.030970                0.104947           0.030970            1       True          1
8   RandomForestMSE_BAG_L1 -116.544294       0.521331   10.616067                0.521331          10.616067            1       True          5
9     ExtraTreesMSE_BAG_L1 -124.588053       0.513939    4.845773                0.513939           4.845773            1       True          7
10         CatBoost_BAG_L1 -130.503441       0.092072  197.782267                0.092072         197.782267            1       True          6
11         LightGBM_BAG_L1 -131.054162       1.289819   29.443498                1.289819          29.443498            1       True          4
12       LightGBMXT_BAG_L1 -131.460909       5.948354   64.125232                5.948354          64.125232            1       True          3
13  NeuralNetFastAI_BAG_L1 -137.591119       0.405960   77.112434                0.405960          77.112434            1       True          8

Create predictions from test dataset¶

In [19]:
predictions = predictor.predict(test)
predictions.head()
Out[19]:
0    23.979008
1    41.106430
2    45.552490
3    48.853279
4    51.996368
Name: count, dtype: float32

NOTE: Kaggle will reject the submission if we don't set everything to be > 0.¶

In [20]:
# Describe the `predictions` series to see if there are any negative values
predictions.describe()
Out[20]:
count    6493.000000
mean      100.831001
std        89.846375
min         3.146910
25%        20.709923
50%        63.959476
75%       168.133774
max       364.188293
Name: count, dtype: float64
In [21]:
# How many negative values do we have?
predictions_df = pd.DataFrame(predictions)
count_neg = len(predictions_df[predictions_df["count"] < 0])
In [22]:
# Set them to zero
if count_neg > 0:
    predictions_df.loc[predictions_df["count"] < 0, ["count"]] = 0
    print("{} Negative predictions were set to zero" . format(count_neg))
    print(predictions_df[predictions_df["count"]==0])
else: print("{} negatives values were found" .format(count_neg))
0 negatives values were found

Set predictions to submission dataframe, save, and submit¶

In [23]:
submission["count"] = predictions.round(0).astype(int)
submission.to_csv("submission.csv", index=False)
In [24]:
!kaggle competitions submit -c bike-sharing-demand -f submission.csv -m "first raw submission"
100%|█████████████████████████████████████████| 148k/148k [00:00<00:00, 264kB/s]
Successfully submitted to Bike Sharing Demand

View submission via the command line or in the web browser under the competition's page - My Submissions¶

In [25]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName                     date                 description                                   status    publicScore  privateScore  
---------------------------  -------------------  --------------------------------------------  --------  -----------  ------------  
submission.csv               2023-01-04 02:17:18  first raw submission                          complete  1.79200      1.79200       
submission_hpo.csv           2023-01-04 01:59:41  model with new features and hpo               complete  0.47675      0.47675       
submission_hpo.csv           2023-01-04 01:45:44  model with new features and hpo               complete  0.48014      0.48014       
submission_hpo.csv           2023-01-04 01:33:22  model with new features and hpo               complete  0.50426      0.50426       
tail: write error: Broken pipe

Initial score of 1.79200¶

In [26]:
#Score: 1.79200

Step 4: Exploratory Data Analysis and Creating an additional feature¶

  • Any additional feature will do, but a great suggestion would be to separate out the datetime into hour, day, or month parts.
In [27]:
# Create a histogram of all features to show the distribution of each one relative to the data. This is part of the exploritory data analysis
train.hist(figsize=(12, 10))
plt.show()
In [28]:
# Create a new feature
train["hour"] = train["datetime"].dt.hour
test["hour"] = test["datetime"].dt.hour
train.head()
Out[28]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count hour
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 1
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 3
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 4
  • Aditional features are explored:
In [29]:
# Profiler report (train data)
profile = ProfileReport(train)
profile.to_notebook_iframe()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
In [30]:
# Visualizations
# Distribution of hourly bike demand by time features
train.groupby([train["datetime"].dt.month, "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by month (train data)")
train.groupby([train["datetime"].dt.hour, "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by hour (train data)")
train.groupby([train["datetime"].dt.dayofweek, "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by dayofweek (train data)")
plt.show()
In [31]:
train.groupby(["holiday"])["count"].median().plot(
    kind='bar', title="Median of hourly bike demand by holiday (train data)")
plt.show()
In [32]:
# Distribution of hourly bike demand by weather features
train.groupby(["season", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by season (train data)")
train.groupby(["weather", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by weather (train data)")
train.groupby(["temp", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by temp (train data)")
train.groupby(["atemp", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by atemp (train data)")
train.groupby(["windspeed", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by windspeed (train data)")
train.groupby(["humidity", "workingday"])["count"].median().unstack().plot(
    kind='bar', title="Median of hourly bike demand by humidity (train data)")
plt.show()
In [33]:
# Distribution of events by categorical features
train["season"].value_counts().plot(
    kind='bar', title="Number of events by season (train data)")
plt.show()
train["weather"].value_counts().plot(
    kind='bar', title="Number of events by weather (train data)")
plt.show()
train["holiday"].value_counts().plot(
    kind='bar', title="Number of events by holiday (train data)")
plt.show()
train["workingday"].value_counts().plot(
    kind='bar', title="Number of events by workingday (train data)")
plt.show()
  • Weather values are updated
In [34]:
display(train[train['weather'] == 4])
display(test[test['weather'] == 4])
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count hour
5631 2012-01-09 18:00:00 1 0 1 4 8.2 11.365 86 6.0032 6 158 164 18
datetime season holiday workingday weather temp atemp humidity windspeed hour
154 2011-01-26 16:00:00 1 0 1 4 9.02 9.85 93 22.0028 16
3248 2012-01-21 01:00:00 1 0 0 4 5.74 6.82 86 12.9980 1
In [36]:
# As there are only 3 events in weather category 4 ("heavy rain"), those values are replaced as category 3 ("light rain")  
train.loc[train['weather'] == 4, 'weather'] = 3
test.loc[test['weather'] == 4, 'weather'] = 3
  • Aditional features are generated:
In [38]:
# Functions for generating new features values

def get_time_of_day(hour):
    if (hour >= 7) & (hour <= 9):
        return "morning"
    elif (hour >= 12) & (hour <= 15):
        return "lunch"
    elif (hour >= 16) & (hour <= 19):
        return "rush_hour"
    elif (hour >= 20) & (hour <= 23):
        return "night"
    else: return "other"
    
def get_tempcat(temp):
    if (temp >= 35):
        return "very hot"
    elif (temp >= 25) & (temp < 35):
        return "hot"
    elif (temp >= 15) & (temp < 25):
        return "warm"
    elif (temp >= 10) & (temp < 15):
        return "cool"
    else: return "cold"
    
def get_windcat(windspeed):
    if (windspeed > 20):
        return "windy"
    elif (windspeed > 10) & (windspeed <= 20):
        return "mild"
    else: return "low"
    
def get_humiditycat(humidity):
    if (humidity >= 80):
        return "high"
    elif (humidity > 40) & (humidity < 80):
        return "mild"
    else: return "low"
In [39]:
# New features are generated
train["time_of_day"] = train['hour'].apply(get_time_of_day)
test['time_of_day'] = test['hour'].apply(get_time_of_day)
train['atempcat'] = train['atemp'].apply(get_tempcat)
test['atempcat'] = test['atemp'].apply(get_tempcat)
train['tempcat'] = train['temp'].apply(get_tempcat)
test['tempcat'] = test['temp'].apply(get_tempcat)
train['windcat'] = train['windspeed'].apply(get_windcat)
test['windcat'] = test['windspeed'].apply(get_windcat)
train['humiditycat'] = train['humidity'].apply(get_humiditycat)
test['humiditycat'] = test['humidity'].apply(get_humiditycat)
In [40]:
train.head()
Out[40]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count hour time_of_day atempcat tempcat windcat humiditycat
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0 other cool cold low high
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 1 other cool cold low high
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2 other cool cold low high
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 3 other cool cold low mild
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 4 other cool cold low mild
In [41]:
# Plot new categories
train["time_of_day"].value_counts().plot(
    kind='bar', title="Number of events by time_of_day (train data)")
plt.show()

train["atempcat"].value_counts().plot(
    kind='bar', title="Number of events by atempcat (train data)")
plt.show()

train["tempcat"].value_counts().plot(
    kind='bar', title="Number of events by tempcat (train data)")
plt.show()

train["windcat"].value_counts().plot(
    kind='bar', title="Number of events by windcat (train data)")
plt.show()

train["humiditycat"].value_counts().plot(
    kind='bar', title="Number of events by humiditycat (train data)")
plt.show()

Make category types for these so models know they are not just numbers¶

  • AutoGluon originally sees these as ints, but in reality they are int representations of a category.
  • Setting the dtype to category will classify these as categories in AutoGluon.
In [42]:
category_list = ["season", "weather", "holiday", "workingday"]
train[category_list] = train[category_list].astype("category")
test[category_list] = test[category_list].astype("category")
  • New features types are set:
In [43]:
new_category_list = ["time_of_day", "atempcat", "windcat", "humiditycat", "tempcat"]
train[new_category_list] = train[new_category_list].astype("category")
test[new_category_list] = test[new_category_list].astype("category")
In [44]:
# View the new feature
train.info()
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     10886 non-null  datetime64[ns]
 1   season       10886 non-null  category      
 2   holiday      10886 non-null  category      
 3   workingday   10886 non-null  category      
 4   weather      10886 non-null  category      
 5   temp         10886 non-null  float64       
 6   atemp        10886 non-null  float64       
 7   humidity     10886 non-null  int64         
 8   windspeed    10886 non-null  float64       
 9   casual       10886 non-null  int64         
 10  registered   10886 non-null  int64         
 11  count        10886 non-null  int64         
 12  hour         10886 non-null  int64         
 13  time_of_day  10886 non-null  category      
 14  atempcat     10886 non-null  category      
 15  tempcat      10886 non-null  category      
 16  windcat      10886 non-null  category      
 17  humiditycat  10886 non-null  category      
dtypes: category(9), datetime64[ns](1), float64(3), int64(5)
memory usage: 862.7 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     6493 non-null   datetime64[ns]
 1   season       6493 non-null   category      
 2   holiday      6493 non-null   category      
 3   workingday   6493 non-null   category      
 4   weather      6493 non-null   category      
 5   temp         6493 non-null   float64       
 6   atemp        6493 non-null   float64       
 7   humidity     6493 non-null   int64         
 8   windspeed    6493 non-null   float64       
 9   hour         6493 non-null   int64         
 10  time_of_day  6493 non-null   category      
 11  atempcat     6493 non-null   category      
 12  tempcat      6493 non-null   category      
 13  windcat      6493 non-null   category      
 14  humiditycat  6493 non-null   category      
dtypes: category(9), datetime64[ns](1), float64(3), int64(2)
memory usage: 363.0 KB
In [45]:
# View histogram of all features again now with the hour feature
train.hist(figsize=(10, 8))
plt.show()
In [46]:
train.head()
Out[46]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count hour time_of_day atempcat tempcat windcat humiditycat
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0 other cool cold low high
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 1 other cool cold low high
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2 other cool cold low high
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 3 other cool cold low mild
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 4 other cool cold low mild

Step 5: Rerun the model with the same settings as before, just with more features¶

In [47]:
# Fit model
learner_kwargs = {
    "ignored_columns": ["casual", "registered", "atemp", "windspeed", "humidity", "temp"]
}

predictor_new_features = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression", 
    eval_metric="root_mean_squared_error").fit(train_data=train, time_limit=600, presets="best_quality")
No path specified. Models will be saved in: "AutogluonModels/ag-20230104_022345/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20230104_022345/"
AutoGluon Version:  0.6.1
Python Version:     3.7.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Oct 26 20:36:53 UTC 2022
Train Data Rows:    10886
Train Data Columns: 17
Label Column: count
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Dropping user-specified ignored columns: ['casual', 'registered', 'atemp', 'windspeed', 'humidity', 'temp']
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1929.77 MB
	Train Data (Original)  Memory Usage: 0.27 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
/usr/local/lib/python3.7/site-packages/autogluon/features/generators/datetime.py:59: FutureWarning: casting datetime64[ns, UTC] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
  good_rows = series[~series.isin(bad_rows)].astype(np.int64)
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('category', []) : 9 | ['season', 'holiday', 'workingday', 'weather', 'time_of_day', ...]
		('datetime', []) : 1 | ['datetime']
		('int', [])      : 1 | ['hour']
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])             : 7 | ['season', 'weather', 'time_of_day', 'atempcat', 'tempcat', ...]
		('int', [])                  : 1 | ['hour']
		('int', ['bool'])            : 2 | ['holiday', 'workingday']
		('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
	0.3s = Fit runtime
	11 features in original data used to generate 15 features in processed data.
	Train Data (Processed) Memory Usage: 0.62 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.36s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
	To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 399.66s of the 599.63s of remaining time.
	-101.5462	 = Validation score   (-root_mean_squared_error)
	0.02s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 399.31s of the 599.29s of remaining time.
	-84.1251	 = Validation score   (-root_mean_squared_error)
	0.02s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 398.97s of the 598.94s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-35.8791	 = Validation score   (-root_mean_squared_error)
	77.22s	 = Training   runtime
	7.39s	 = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 313.14s of the 513.11s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-32.9854	 = Validation score   (-root_mean_squared_error)
	54.48s	 = Training   runtime
	4.5s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 253.71s of the 453.68s of remaining time.
	-39.1691	 = Validation score   (-root_mean_squared_error)
	9.88s	 = Training   runtime
	0.57s	 = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 240.74s of the 440.71s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-35.9861	 = Validation score   (-root_mean_squared_error)
	205.83s	 = Training   runtime
	0.21s	 = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ... Training model for up to 30.9s of the 230.87s of remaining time.
	-38.8958	 = Validation score   (-root_mean_squared_error)
	5.56s	 = Training   runtime
	0.56s	 = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 22.2s of the 222.17s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-75.0209	 = Validation score   (-root_mean_squared_error)
	44.2s	 = Training   runtime
	0.49s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 173.69s of remaining time.
	-32.2135	 = Validation score   (-root_mean_squared_error)
	0.69s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting 9 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 172.92s of the 172.9s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-31.9977	 = Validation score   (-root_mean_squared_error)
	28.23s	 = Training   runtime
	0.65s	 = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 140.29s of the 140.27s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-30.7289	 = Validation score   (-root_mean_squared_error)
	25.28s	 = Training   runtime
	0.28s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 110.94s of the 110.92s of remaining time.
	-32.1346	 = Validation score   (-root_mean_squared_error)
	26.81s	 = Training   runtime
	0.61s	 = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 81.22s of the 81.2s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-31.1785	 = Validation score   (-root_mean_squared_error)
	77.67s	 = Training   runtime
	0.14s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the -0.38s of remaining time.
	-30.5547	 = Validation score   (-root_mean_squared_error)
	0.33s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 600.91s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230104_022345/")
In [48]:
# Get detailed info of the predictor
pred_nf_info = predictor_new_features.info()
with open('docs/pred_nf_info.json', 'w') as convert_file:
	convert_file.write(json.dumps(pred_nf_info, default=str))
In [49]:
predictor_new_features.fit_summary(show_plot=False)
*** Summary of fit() ***
Estimated performance of each model:
                     model   score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0      WeightedEnsemble_L3  -30.554656      15.607355  555.528752                0.001309           0.329112            3       True         14
1          LightGBM_BAG_L2  -30.728931      14.204675  422.493474                0.284346          25.281531            2       True         11
2          CatBoost_BAG_L2  -31.178451      14.059996  474.882658                0.139667          77.670714            2       True         13
3        LightGBMXT_BAG_L2  -31.997696      14.572388  425.438654                0.652059          28.226711            2       True         10
4   RandomForestMSE_BAG_L2  -32.134557      14.529974  424.020685                0.609645          26.808742            2       True         12
5      WeightedEnsemble_L2  -32.213506      12.771409  348.123248                0.001235           0.694335            2       True          9
6          LightGBM_BAG_L1  -32.985441       4.500878   54.476392                4.500878          54.476392            1       True          4
7        LightGBMXT_BAG_L1  -35.879059       7.386585   77.218651                7.386585          77.218651            1       True          3
8          CatBoost_BAG_L1  -35.986122       0.213159  205.830238                0.213159         205.830238            1       True          6
9     ExtraTreesMSE_BAG_L1  -38.895772       0.558258    5.564719                0.558258           5.564719            1       True          7
10  RandomForestMSE_BAG_L1  -39.169105       0.565517    9.879488                0.565517           9.879488            1       True          5
11  NeuralNetFastAI_BAG_L1  -75.020906       0.488890   44.199071                0.488890          44.199071            1       True          8
12   KNeighborsDist_BAG_L1  -84.125061       0.104034    0.024144                0.104034           0.024144            1       True          2
13   KNeighborsUnif_BAG_L1 -101.546199       0.103008    0.019241                0.103008           0.019241            1       True          1
Number of models trained: 14
Types of models trained:
{'StackerEnsembleModel_KNN', 'StackerEnsembleModel_LGB', 'StackerEnsembleModel_XT', 'StackerEnsembleModel_NNFastAiTabular', 'WeightedEnsembleModel', 'StackerEnsembleModel_CatBoost', 'StackerEnsembleModel_RF'}
Bagging used: True  (with 8 folds)
Multi-layer stack-ensembling used: True  (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', [])             : 7 | ['season', 'weather', 'time_of_day', 'atempcat', 'tempcat', ...]
('int', [])                  : 1 | ['hour']
('int', ['bool'])            : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20230104_022345/SummaryOfModels.html
*** End of fit() summary ***
Out[49]:
{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
  'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
  'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
  'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
  'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel',
  'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
  'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
  'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif_BAG_L1': -101.54619908446061,
  'KNeighborsDist_BAG_L1': -84.12506123181602,
  'LightGBMXT_BAG_L1': -35.87905943622523,
  'LightGBM_BAG_L1': -32.98544134736175,
  'RandomForestMSE_BAG_L1': -39.16910479120037,
  'CatBoost_BAG_L1': -35.98612160843292,
  'ExtraTreesMSE_BAG_L1': -38.89577199128411,
  'NeuralNetFastAI_BAG_L1': -75.02090615783968,
  'WeightedEnsemble_L2': -32.213505766365564,
  'LightGBMXT_BAG_L2': -31.997695966545237,
  'LightGBM_BAG_L2': -30.728930885149015,
  'RandomForestMSE_BAG_L2': -32.13455742507246,
  'CatBoost_BAG_L2': -31.17845121244172,
  'WeightedEnsemble_L3': -30.554655562394768},
 'model_best': 'WeightedEnsemble_L3',
 'model_paths': {'KNeighborsUnif_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/KNeighborsUnif_BAG_L1/',
  'KNeighborsDist_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/KNeighborsDist_BAG_L1/',
  'LightGBMXT_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/LightGBMXT_BAG_L1/',
  'LightGBM_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/LightGBM_BAG_L1/',
  'RandomForestMSE_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/RandomForestMSE_BAG_L1/',
  'CatBoost_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/CatBoost_BAG_L1/',
  'ExtraTreesMSE_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/ExtraTreesMSE_BAG_L1/',
  'NeuralNetFastAI_BAG_L1': 'AutogluonModels/ag-20230104_022345/models/NeuralNetFastAI_BAG_L1/',
  'WeightedEnsemble_L2': 'AutogluonModels/ag-20230104_022345/models/WeightedEnsemble_L2/',
  'LightGBMXT_BAG_L2': 'AutogluonModels/ag-20230104_022345/models/LightGBMXT_BAG_L2/',
  'LightGBM_BAG_L2': 'AutogluonModels/ag-20230104_022345/models/LightGBM_BAG_L2/',
  'RandomForestMSE_BAG_L2': 'AutogluonModels/ag-20230104_022345/models/RandomForestMSE_BAG_L2/',
  'CatBoost_BAG_L2': 'AutogluonModels/ag-20230104_022345/models/CatBoost_BAG_L2/',
  'WeightedEnsemble_L3': 'AutogluonModels/ag-20230104_022345/models/WeightedEnsemble_L3/'},
 'model_fit_times': {'KNeighborsUnif_BAG_L1': 0.019240617752075195,
  'KNeighborsDist_BAG_L1': 0.02414417266845703,
  'LightGBMXT_BAG_L1': 77.2186508178711,
  'LightGBM_BAG_L1': 54.476391553878784,
  'RandomForestMSE_BAG_L1': 9.879487752914429,
  'CatBoost_BAG_L1': 205.83023834228516,
  'ExtraTreesMSE_BAG_L1': 5.564719200134277,
  'NeuralNetFastAI_BAG_L1': 44.19907069206238,
  'WeightedEnsemble_L2': 0.6943349838256836,
  'LightGBMXT_BAG_L2': 28.226710557937622,
  'LightGBM_BAG_L2': 25.281530618667603,
  'RandomForestMSE_BAG_L2': 26.8087420463562,
  'CatBoost_BAG_L2': 77.67071437835693,
  'WeightedEnsemble_L3': 0.32911157608032227},
 'model_pred_times': {'KNeighborsUnif_BAG_L1': 0.10300803184509277,
  'KNeighborsDist_BAG_L1': 0.1040341854095459,
  'LightGBMXT_BAG_L1': 7.386584997177124,
  'LightGBM_BAG_L1': 4.500877857208252,
  'RandomForestMSE_BAG_L1': 0.5655171871185303,
  'CatBoost_BAG_L1': 0.21315884590148926,
  'ExtraTreesMSE_BAG_L1': 0.5582578182220459,
  'NeuralNetFastAI_BAG_L1': 0.4888899326324463,
  'WeightedEnsemble_L2': 0.0012354850769042969,
  'LightGBMXT_BAG_L2': 0.6520588397979736,
  'LightGBM_BAG_L2': 0.2843458652496338,
  'RandomForestMSE_BAG_L2': 0.6096453666687012,
  'CatBoost_BAG_L2': 0.13966703414916992,
  'WeightedEnsemble_L3': 0.0013093948364257812},
 'num_bag_folds': 8,
 'max_stack_level': 3,
 'model_hyperparams': {'KNeighborsUnif_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'KNeighborsDist_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'LightGBMXT_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'RandomForestMSE_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'CatBoost_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'ExtraTreesMSE_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'NeuralNetFastAI_BAG_L1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L2': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBMXT_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'RandomForestMSE_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True,
   'use_child_oof': True},
  'CatBoost_BAG_L2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L3': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True}},
 'leaderboard':                      model   score_val  pred_time_val    fit_time  \
 0      WeightedEnsemble_L3  -30.554656      15.607355  555.528752   
 1          LightGBM_BAG_L2  -30.728931      14.204675  422.493474   
 2          CatBoost_BAG_L2  -31.178451      14.059996  474.882658   
 3        LightGBMXT_BAG_L2  -31.997696      14.572388  425.438654   
 4   RandomForestMSE_BAG_L2  -32.134557      14.529974  424.020685   
 5      WeightedEnsemble_L2  -32.213506      12.771409  348.123248   
 6          LightGBM_BAG_L1  -32.985441       4.500878   54.476392   
 7        LightGBMXT_BAG_L1  -35.879059       7.386585   77.218651   
 8          CatBoost_BAG_L1  -35.986122       0.213159  205.830238   
 9     ExtraTreesMSE_BAG_L1  -38.895772       0.558258    5.564719   
 10  RandomForestMSE_BAG_L1  -39.169105       0.565517    9.879488   
 11  NeuralNetFastAI_BAG_L1  -75.020906       0.488890   44.199071   
 12   KNeighborsDist_BAG_L1  -84.125061       0.104034    0.024144   
 13   KNeighborsUnif_BAG_L1 -101.546199       0.103008    0.019241   
 
     pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  \
 0                 0.001309           0.329112            3       True   
 1                 0.284346          25.281531            2       True   
 2                 0.139667          77.670714            2       True   
 3                 0.652059          28.226711            2       True   
 4                 0.609645          26.808742            2       True   
 5                 0.001235           0.694335            2       True   
 6                 4.500878          54.476392            1       True   
 7                 7.386585          77.218651            1       True   
 8                 0.213159         205.830238            1       True   
 9                 0.558258           5.564719            1       True   
 10                0.565517           9.879488            1       True   
 11                0.488890          44.199071            1       True   
 12                0.104034           0.024144            1       True   
 13                0.103008           0.019241            1       True   
 
     fit_order  
 0          14  
 1          11  
 2          13  
 3          10  
 4          12  
 5           9  
 6           4  
 7           3  
 8           6  
 9           7  
 10          5  
 11          8  
 12          2  
 13          1  }
In [50]:
predictor_new_features.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
Out[50]:
<AxesSubplot:xlabel='model'>
In [51]:
# Save validation scores
leaderboard_nf = predictor_new_features.leaderboard()
leaderboard_nf["description"] = "scores with new features"
leaderboard_nf.to_csv("docs/leaderboard_nf.csv", index=False)
                     model   score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0      WeightedEnsemble_L3  -30.554656      15.607355  555.528752                0.001309           0.329112            3       True         14
1          LightGBM_BAG_L2  -30.728931      14.204675  422.493474                0.284346          25.281531            2       True         11
2          CatBoost_BAG_L2  -31.178451      14.059996  474.882658                0.139667          77.670714            2       True         13
3        LightGBMXT_BAG_L2  -31.997696      14.572388  425.438654                0.652059          28.226711            2       True         10
4   RandomForestMSE_BAG_L2  -32.134557      14.529974  424.020685                0.609645          26.808742            2       True         12
5      WeightedEnsemble_L2  -32.213506      12.771409  348.123248                0.001235           0.694335            2       True          9
6          LightGBM_BAG_L1  -32.985441       4.500878   54.476392                4.500878          54.476392            1       True          4
7        LightGBMXT_BAG_L1  -35.879059       7.386585   77.218651                7.386585          77.218651            1       True          3
8          CatBoost_BAG_L1  -35.986122       0.213159  205.830238                0.213159         205.830238            1       True          6
9     ExtraTreesMSE_BAG_L1  -38.895772       0.558258    5.564719                0.558258           5.564719            1       True          7
10  RandomForestMSE_BAG_L1  -39.169105       0.565517    9.879488                0.565517           9.879488            1       True          5
11  NeuralNetFastAI_BAG_L1  -75.020906       0.488890   44.199071                0.488890          44.199071            1       True          8
12   KNeighborsDist_BAG_L1  -84.125061       0.104034    0.024144                0.104034           0.024144            1       True          2
13   KNeighborsUnif_BAG_L1 -101.546199       0.103008    0.019241                0.103008           0.019241            1       True          1
In [52]:
predictions_nf = predictor_new_features.predict(test)
predictions_nf.head()
Out[52]:
0    15.458272
1    11.300541
2    10.280220
3     9.255146
4     8.010801
Name: count, dtype: float32
In [53]:
predictions_nf.describe()
Out[53]:
count    6493.000000
mean      154.990692
std       134.164993
min         2.731123
25%        50.973583
50%       121.053001
75%       217.896561
max       789.729919
Name: count, dtype: float64
In [54]:
# Remember to set all negative values to zero
predictions_nf_df = pd.DataFrame(predictions_nf)
count_neg = len(predictions_nf_df[predictions_nf_df["count"] < 0])

if count_neg > 0:
    predictions_nf_df.loc[predictions_nf_df["count"] < 0, ["count"]] = 0
    print("{} Negative predictions were set to zero" . format(count_neg))
    print(predictions_nf_df[predictions_nf_df["count"]==0])
else: print("{} negatives values were found" .format(count_neg))
0 negatives values were found
In [55]:
submission_new_features = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission_new_features.head()
Out[55]:
datetime count
0 2011-01-20 00:00:00 0
1 2011-01-20 01:00:00 0
2 2011-01-20 02:00:00 0
3 2011-01-20 03:00:00 0
4 2011-01-20 04:00:00 0
In [56]:
# Same submitting predictions
submission_new_features["count"] = predictions_nf.round(0).astype(int)
submission_new_features.to_csv("submission_new_features.csv", index=False)
In [57]:
!kaggle competitions submit -c bike-sharing-demand -f submission_new_features.csv -m "model with new features"
100%|█████████████████████████████████████████| 149k/149k [00:00<00:00, 279kB/s]
Successfully submitted to Bike Sharing Demand
In [58]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName                     date                 description                                   status    publicScore  privateScore  
---------------------------  -------------------  --------------------------------------------  --------  -----------  ------------  
submission_new_features.csv  2023-01-04 02:38:24  model with new features                       complete  0.65341      0.65341       
submission.csv               2023-01-04 02:17:18  first raw submission                          complete  1.79200      1.79200       
submission_hpo.csv           2023-01-04 01:59:41  model with new features and hpo               complete  0.47675      0.47675       
submission_hpo.csv           2023-01-04 01:45:44  model with new features and hpo               complete  0.48014      0.48014       
tail: write error: Broken pipe

New Score of 0.65341¶

In [59]:
#Score with one additional feature (hour): 0.67642
#Score with more features: 0.65341

Step 6: Hyper parameter optimization¶

  • There are many options for hyper parameter optimization.
  • Options are to change the AutoGluon higher level parameters or the individual model hyperparameters.
  • The hyperparameters of the models themselves that are in AutoGluon. Those need the hyperparameter and hyperparameter_tune_kwargs arguments.
In [60]:
import autogluon.core as ag

# high level hyperparameters: 
# num_stack_levels: max possible is 3.
# num_bag_folds:  values between 5 -10 are recommended by Autogluon (default = 5 with best_quality={‘auto_stack’: True} presets) .
# num_bag_sets: max possible is 20 when time_limit hyperparameter is set.
        
# hyperparameters
gbm_options = {  
    'num_boost_round': 200,  
    'num_leaves': ag.space.Int(lower=26, upper=66, default=36), 
    'learning_rate' : 0.03,
}

cat_options = {
    'iterations' : 10000,
    'learning_rate' : 0.03,
    'depth' : ag.space.Int(lower=2, upper=8, default=6)
}

hyperparameters = { 
                   'GBM': gbm_options,
                   'CAT': cat_options, 
                  }  


# hyperparameter_tune_kwargs
num_trials = 5  # Restricted to time_limit hyperparameter.

hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified
    'num_trials': num_trials,
    'scheduler' : 'local',
    'searcher': 'auto', # AutoGluon performs a random search
}

learner_kwargs = {
    "ignored_columns": ["casual", "registered", "atemp", "windspeed", "humidity", "temp"]
}

predictor_hpo = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression", 
    eval_metric="root_mean_squared_error").fit(
        train_data=train, 
        time_limit=600,
        num_stack_levels=3, 
        num_bag_folds=10, 
        num_bag_sets=20,
        hyperparameters=hyperparameters,
        hyperparameter_tune_kwargs=hyperparameter_tune_kwargs
        )
No path specified. Models will be saved in: "AutogluonModels/ag-20230104_025616/"
Warning: hyperparameter tuning is currently experimental and may cause the process to hang.
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20230104_025616/"
AutoGluon Version:  0.6.1
Python Version:     3.7.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Oct 26 20:36:53 UTC 2022
Train Data Rows:    10886
Train Data Columns: 17
Label Column: count
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Dropping user-specified ignored columns: ['casual', 'registered', 'atemp', 'windspeed', 'humidity', 'temp']
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1879.08 MB
	Train Data (Original)  Memory Usage: 0.27 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
/usr/local/lib/python3.7/site-packages/autogluon/features/generators/datetime.py:59: FutureWarning: casting datetime64[ns, UTC] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
  good_rows = series[~series.isin(bad_rows)].astype(np.int64)
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('category', []) : 9 | ['season', 'holiday', 'workingday', 'weather', 'time_of_day', ...]
		('datetime', []) : 1 | ['datetime']
		('int', [])      : 1 | ['hour']
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])             : 7 | ['season', 'weather', 'time_of_day', 'atempcat', 'tempcat', ...]
		('int', [])                  : 1 | ['hour']
		('int', ['bool'])            : 2 | ['holiday', 'workingday']
		('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
	0.2s = Fit runtime
	11 features in original data used to generate 15 features in processed data.
	Train Data (Processed) Memory Usage: 0.62 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.23s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
	To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 4 stack levels (L1 to L4) ...
Fitting 2 L1 models ...
Hyperparameter tuning model: LightGBM_BAG_L1 ... Tuning model for up to 89.94s of the 599.76s of remaining time.
  0%|          | 0/5 [00:00<?, ?it/s]
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Stopping HPO to satisfy time limit...
Fitted model: LightGBM_BAG_L1/T1 ...
	-39.6668	 = Validation score   (-root_mean_squared_error)
	30.37s	 = Training   runtime
	0.0s	 = Validation runtime
Fitted model: LightGBM_BAG_L1/T2 ...
	-41.5783	 = Validation score   (-root_mean_squared_error)
	30.09s	 = Training   runtime
	0.0s	 = Validation runtime
Fitted model: LightGBM_BAG_L1/T3 ...
	-36.7635	 = Validation score   (-root_mean_squared_error)
	32.56s	 = Training   runtime
	0.0s	 = Validation runtime
Hyperparameter tuning model: CatBoost_BAG_L1 ... Tuning model for up to 89.94s of the 506.48s of remaining time.
  0%|          | 0/5 [00:00<?, ?it/s]
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Stopping HPO to satisfy time limit...
Fitted model: CatBoost_BAG_L1/T1 ...
	-43.2934	 = Validation score   (-root_mean_squared_error)
	89.41s	 = Training   runtime
	0.0s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 416.91s of remaining time.
	-36.754	 = Validation score   (-root_mean_squared_error)
	0.29s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting 2 L2 models ...
Hyperparameter tuning model: LightGBM_BAG_L2 ... Tuning model for up to 83.29s of the 416.52s of remaining time.
  0%|          | 0/5 [00:00<?, ?it/s]
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Stopping HPO to satisfy time limit...
Fitted model: LightGBM_BAG_L2/T1 ...
	-36.4531	 = Validation score   (-root_mean_squared_error)
	31.57s	 = Training   runtime
	0.0s	 = Validation runtime
Fitted model: LightGBM_BAG_L2/T2 ...
	-36.4951	 = Validation score   (-root_mean_squared_error)
	32.42s	 = Training   runtime
	0.0s	 = Validation runtime
Hyperparameter tuning model: CatBoost_BAG_L2 ... Tuning model for up to 83.29s of the 352.34s of remaining time.
  0%|          | 0/5 [00:00<?, ?it/s]
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Stopping HPO to satisfy time limit...
Fitted model: CatBoost_BAG_L2/T1 ...
	-37.1213	 = Validation score   (-root_mean_squared_error)
	84.68s	 = Training   runtime
	0.0s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the 267.48s of remaining time.
	-36.3611	 = Validation score   (-root_mean_squared_error)
	0.23s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting 2 L3 models ...
Hyperparameter tuning model: LightGBM_BAG_L3 ... Tuning model for up to 80.13s of the 267.16s of remaining time.
  0%|          | 0/5 [00:00<?, ?it/s]
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Stopping HPO to satisfy time limit...
Fitted model: LightGBM_BAG_L3/T1 ...
	-37.1023	 = Validation score   (-root_mean_squared_error)
	31.51s	 = Training   runtime
	0.0s	 = Validation runtime
Fitted model: LightGBM_BAG_L3/T2 ...
	-36.9997	 = Validation score   (-root_mean_squared_error)
	30.9s	 = Training   runtime
	0.0s	 = Validation runtime
Hyperparameter tuning model: CatBoost_BAG_L3 ... Tuning model for up to 80.13s of the 204.54s of remaining time.
  0%|          | 0/5 [00:00<?, ?it/s]
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Stopping HPO to satisfy time limit...
Fitted model: CatBoost_BAG_L3/T1 ...
	-36.6865	 = Validation score   (-root_mean_squared_error)
	81.55s	 = Training   runtime
	0.0s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L4 ... Training model for up to 360.0s of the 122.83s of remaining time.
	-36.664	 = Validation score   (-root_mean_squared_error)
	0.23s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting 2 L4 models ...
Hyperparameter tuning model: LightGBM_BAG_L4 ... Tuning model for up to 55.14s of the 122.51s of remaining time.
  0%|          | 0/5 [00:00<?, ?it/s]
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Stopping HPO to satisfy time limit...
Fitted model: LightGBM_BAG_L4/T1 ...
	-37.4417	 = Validation score   (-root_mean_squared_error)
	30.68s	 = Training   runtime
	0.0s	 = Validation runtime
Hyperparameter tuning model: CatBoost_BAG_L4 ... Tuning model for up to 55.14s of the 91.67s of remaining time.
  0%|          | 0/5 [00:00<?, ?it/s]
	Fitting 10 child models (S1F1 - S1F10) | Fitting with ParallelLocalFoldFittingStrategy
	Stopping HPO to satisfy time limit...
Fitted model: CatBoost_BAG_L4/T1 ...
	-37.139	 = Validation score   (-root_mean_squared_error)
	61.62s	 = Training   runtime
	0.0s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L5 ... Training model for up to 360.0s of the 29.88s of remaining time.
	-37.1079	 = Validation score   (-root_mean_squared_error)
	0.18s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 570.51s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230104_025616/")
In [61]:
#predictor_hpo = TabularPredictor.load("AutogluonModels/.../")
In [62]:
# Get detailed info of the predictor
pred_hpo_info = predictor_hpo.info()
with open('docs/pred_hpo_info.json', 'w') as convert_file:
	convert_file.write(json.dumps(pred_hpo_info, default=str))
In [63]:
predictor_hpo.fit_summary(show_plot=False)
*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L3 -36.361050       0.001702  331.316357                0.000933           0.233781            3       True          9
1    LightGBM_BAG_L2/T1 -36.453077       0.000551  213.984909                0.000095          31.565502            2       True          6
2    LightGBM_BAG_L2/T2 -36.495067       0.000588  214.839080                0.000131          32.419674            2       True          7
3   WeightedEnsemble_L4 -36.663953       0.002069  475.266736                0.000921           0.228529            4       True         13
4    CatBoost_BAG_L3/T1 -36.686524       0.000893  412.627928                0.000124          81.545351            3       True         12
5   WeightedEnsemble_L2 -36.753990       0.000954  122.255102                0.000747           0.292852            2       True          5
6    LightGBM_BAG_L1/T3 -36.763544       0.000098   32.556703                0.000098          32.556703            1       True          3
7    LightGBM_BAG_L3/T2 -36.999679       0.000899  361.981402                0.000131          30.898825            3       True         11
8    LightGBM_BAG_L3/T1 -37.102334       0.000894  362.594031                0.000125          31.511455            3       True         10
9   WeightedEnsemble_L5 -37.107944       0.002198  567.524043                0.000823           0.179885            5       True         16
10   CatBoost_BAG_L2/T1 -37.121271       0.000542  267.097401                0.000086          84.677994            2       True          8
11   CatBoost_BAG_L4/T1 -37.139012       0.001281  536.662032                0.000132          61.623824            4       True         15
12   LightGBM_BAG_L4/T1 -37.441667       0.001242  505.720334                0.000093          30.682127            4       True         14
13   LightGBM_BAG_L1/T1 -39.666828       0.000154   30.368811                0.000154          30.368811            1       True          1
14   LightGBM_BAG_L1/T2 -41.578265       0.000095   30.088346                0.000095          30.088346            1       True          2
15   CatBoost_BAG_L1/T1 -43.293445       0.000110   89.405546                0.000110          89.405546            1       True          4
Number of models trained: 16
Types of models trained:
{'StackerEnsembleModel_LGB', 'StackerEnsembleModel_CatBoost', 'WeightedEnsembleModel'}
Bagging used: True  (with 10 folds)
Multi-layer stack-ensembling used: True  (with 5 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', [])             : 7 | ['season', 'weather', 'time_of_day', 'atempcat', 'tempcat', ...]
('int', [])                  : 1 | ['hour']
('int', ['bool'])            : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20230104_025616/SummaryOfModels.html
*** End of fit() summary ***
Out[63]:
{'model_types': {'LightGBM_BAG_L1/T1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1/T2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1/T3': 'StackerEnsembleModel_LGB',
  'CatBoost_BAG_L1/T1': 'StackerEnsembleModel_CatBoost',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel',
  'LightGBM_BAG_L2/T1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L2/T2': 'StackerEnsembleModel_LGB',
  'CatBoost_BAG_L2/T1': 'StackerEnsembleModel_CatBoost',
  'WeightedEnsemble_L3': 'WeightedEnsembleModel',
  'LightGBM_BAG_L3/T1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L3/T2': 'StackerEnsembleModel_LGB',
  'CatBoost_BAG_L3/T1': 'StackerEnsembleModel_CatBoost',
  'WeightedEnsemble_L4': 'WeightedEnsembleModel',
  'LightGBM_BAG_L4/T1': 'StackerEnsembleModel_LGB',
  'CatBoost_BAG_L4/T1': 'StackerEnsembleModel_CatBoost',
  'WeightedEnsemble_L5': 'WeightedEnsembleModel'},
 'model_performance': {'LightGBM_BAG_L1/T1': -39.666828152831606,
  'LightGBM_BAG_L1/T2': -41.578265197796064,
  'LightGBM_BAG_L1/T3': -36.76354412409187,
  'CatBoost_BAG_L1/T1': -43.29344521141029,
  'WeightedEnsemble_L2': -36.75399003349142,
  'LightGBM_BAG_L2/T1': -36.45307681065257,
  'LightGBM_BAG_L2/T2': -36.49506683889113,
  'CatBoost_BAG_L2/T1': -37.12127084525256,
  'WeightedEnsemble_L3': -36.361050242497605,
  'LightGBM_BAG_L3/T1': -37.10233449608509,
  'LightGBM_BAG_L3/T2': -36.99967884689942,
  'CatBoost_BAG_L3/T1': -36.686524245722254,
  'WeightedEnsemble_L4': -36.66395318572601,
  'LightGBM_BAG_L4/T1': -37.44166702766272,
  'CatBoost_BAG_L4/T1': -37.13901210163851,
  'WeightedEnsemble_L5': -37.10794399329098},
 'model_best': 'WeightedEnsemble_L3',
 'model_paths': {'LightGBM_BAG_L1/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L1/T1/',
  'LightGBM_BAG_L1/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L1/T2/',
  'LightGBM_BAG_L1/T3': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L1/T3/',
  'CatBoost_BAG_L1/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/CatBoost_BAG_L1/T1/',
  'WeightedEnsemble_L2': 'AutogluonModels/ag-20230104_025616/models/WeightedEnsemble_L2/',
  'LightGBM_BAG_L2/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L2/T1/',
  'LightGBM_BAG_L2/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L2/T2/',
  'CatBoost_BAG_L2/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/CatBoost_BAG_L2/T1/',
  'WeightedEnsemble_L3': 'AutogluonModels/ag-20230104_025616/models/WeightedEnsemble_L3/',
  'LightGBM_BAG_L3/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L3/T1/',
  'LightGBM_BAG_L3/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L3/T2/',
  'CatBoost_BAG_L3/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/CatBoost_BAG_L3/T1/',
  'WeightedEnsemble_L4': 'AutogluonModels/ag-20230104_025616/models/WeightedEnsemble_L4/',
  'LightGBM_BAG_L4/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/LightGBM_BAG_L4/T1/',
  'CatBoost_BAG_L4/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20230104_025616/models/CatBoost_BAG_L4/T1/',
  'WeightedEnsemble_L5': 'AutogluonModels/ag-20230104_025616/models/WeightedEnsemble_L5/'},
 'model_fit_times': {'LightGBM_BAG_L1/T1': 30.36881136894226,
  'LightGBM_BAG_L1/T2': 30.088345527648926,
  'LightGBM_BAG_L1/T3': 32.556703329086304,
  'CatBoost_BAG_L1/T1': 89.40554642677307,
  'WeightedEnsemble_L2': 0.29285240173339844,
  'LightGBM_BAG_L2/T1': 31.565502166748047,
  'LightGBM_BAG_L2/T2': 32.419673681259155,
  'CatBoost_BAG_L2/T1': 84.67799425125122,
  'WeightedEnsemble_L3': 0.2337806224822998,
  'LightGBM_BAG_L3/T1': 31.511454582214355,
  'LightGBM_BAG_L3/T2': 30.89882493019104,
  'CatBoost_BAG_L3/T1': 81.54535126686096,
  'WeightedEnsemble_L4': 0.22852873802185059,
  'LightGBM_BAG_L4/T1': 30.682126760482788,
  'CatBoost_BAG_L4/T1': 61.62382435798645,
  'WeightedEnsemble_L5': 0.179884672164917},
 'model_pred_times': {'LightGBM_BAG_L1/T1': 0.0001537799835205078,
  'LightGBM_BAG_L1/T2': 9.5367431640625e-05,
  'LightGBM_BAG_L1/T3': 9.751319885253906e-05,
  'CatBoost_BAG_L1/T1': 0.00010991096496582031,
  'WeightedEnsemble_L2': 0.0007467269897460938,
  'LightGBM_BAG_L2/T1': 9.465217590332031e-05,
  'LightGBM_BAG_L2/T2': 0.00013136863708496094,
  'CatBoost_BAG_L2/T1': 8.58306884765625e-05,
  'WeightedEnsemble_L3': 0.0009334087371826172,
  'LightGBM_BAG_L3/T1': 0.00012540817260742188,
  'LightGBM_BAG_L3/T2': 0.00013065338134765625,
  'CatBoost_BAG_L3/T1': 0.00012445449829101562,
  'WeightedEnsemble_L4': 0.0009205341339111328,
  'LightGBM_BAG_L4/T1': 9.298324584960938e-05,
  'CatBoost_BAG_L4/T1': 0.00013208389282226562,
  'WeightedEnsemble_L5': 0.0008234977722167969},
 'num_bag_folds': 10,
 'max_stack_level': 5,
 'model_hyperparams': {'LightGBM_BAG_L1/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L1/T2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L1/T3': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'CatBoost_BAG_L1/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L2': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L2/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L2/T2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'CatBoost_BAG_L2/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L3': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L3/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L3/T2': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'CatBoost_BAG_L3/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L4': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'LightGBM_BAG_L4/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'CatBoost_BAG_L4/T1': {'use_orig_features': True,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True},
  'WeightedEnsemble_L5': {'use_orig_features': False,
   'max_base_models': 25,
   'max_base_models_per_type': 5,
   'save_bag_folds': True}},
 'leaderboard':                   model  score_val  pred_time_val    fit_time  \
 0   WeightedEnsemble_L3 -36.361050       0.001702  331.316357   
 1    LightGBM_BAG_L2/T1 -36.453077       0.000551  213.984909   
 2    LightGBM_BAG_L2/T2 -36.495067       0.000588  214.839080   
 3   WeightedEnsemble_L4 -36.663953       0.002069  475.266736   
 4    CatBoost_BAG_L3/T1 -36.686524       0.000893  412.627928   
 5   WeightedEnsemble_L2 -36.753990       0.000954  122.255102   
 6    LightGBM_BAG_L1/T3 -36.763544       0.000098   32.556703   
 7    LightGBM_BAG_L3/T2 -36.999679       0.000899  361.981402   
 8    LightGBM_BAG_L3/T1 -37.102334       0.000894  362.594031   
 9   WeightedEnsemble_L5 -37.107944       0.002198  567.524043   
 10   CatBoost_BAG_L2/T1 -37.121271       0.000542  267.097401   
 11   CatBoost_BAG_L4/T1 -37.139012       0.001281  536.662032   
 12   LightGBM_BAG_L4/T1 -37.441667       0.001242  505.720334   
 13   LightGBM_BAG_L1/T1 -39.666828       0.000154   30.368811   
 14   LightGBM_BAG_L1/T2 -41.578265       0.000095   30.088346   
 15   CatBoost_BAG_L1/T1 -43.293445       0.000110   89.405546   
 
     pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  \
 0                 0.000933           0.233781            3       True   
 1                 0.000095          31.565502            2       True   
 2                 0.000131          32.419674            2       True   
 3                 0.000921           0.228529            4       True   
 4                 0.000124          81.545351            3       True   
 5                 0.000747           0.292852            2       True   
 6                 0.000098          32.556703            1       True   
 7                 0.000131          30.898825            3       True   
 8                 0.000125          31.511455            3       True   
 9                 0.000823           0.179885            5       True   
 10                0.000086          84.677994            2       True   
 11                0.000132          61.623824            4       True   
 12                0.000093          30.682127            4       True   
 13                0.000154          30.368811            1       True   
 14                0.000095          30.088346            1       True   
 15                0.000110          89.405546            1       True   
 
     fit_order  
 0           9  
 1           6  
 2           7  
 3          13  
 4          12  
 5           5  
 6           3  
 7          11  
 8          10  
 9          16  
 10          8  
 11         15  
 12         14  
 13          1  
 14          2  
 15          4  }
In [64]:
predictor_hpo.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
Out[64]:
<AxesSubplot:xlabel='model'>
In [65]:
# Save validation scores
leaderboard_hpo = predictor_hpo.leaderboard()
leaderboard_hpo["description"] = "hpo scores"
leaderboard_hpo.to_csv("docs/leaderboard_hpo.csv", index=False)
                  model  score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L3 -36.361050       0.001702  331.316357                0.000933           0.233781            3       True          9
1    LightGBM_BAG_L2/T1 -36.453077       0.000551  213.984909                0.000095          31.565502            2       True          6
2    LightGBM_BAG_L2/T2 -36.495067       0.000588  214.839080                0.000131          32.419674            2       True          7
3   WeightedEnsemble_L4 -36.663953       0.002069  475.266736                0.000921           0.228529            4       True         13
4    CatBoost_BAG_L3/T1 -36.686524       0.000893  412.627928                0.000124          81.545351            3       True         12
5   WeightedEnsemble_L2 -36.753990       0.000954  122.255102                0.000747           0.292852            2       True          5
6    LightGBM_BAG_L1/T3 -36.763544       0.000098   32.556703                0.000098          32.556703            1       True          3
7    LightGBM_BAG_L3/T2 -36.999679       0.000899  361.981402                0.000131          30.898825            3       True         11
8    LightGBM_BAG_L3/T1 -37.102334       0.000894  362.594031                0.000125          31.511455            3       True         10
9   WeightedEnsemble_L5 -37.107944       0.002198  567.524043                0.000823           0.179885            5       True         16
10   CatBoost_BAG_L2/T1 -37.121271       0.000542  267.097401                0.000086          84.677994            2       True          8
11   CatBoost_BAG_L4/T1 -37.139012       0.001281  536.662032                0.000132          61.623824            4       True         15
12   LightGBM_BAG_L4/T1 -37.441667       0.001242  505.720334                0.000093          30.682127            4       True         14
13   LightGBM_BAG_L1/T1 -39.666828       0.000154   30.368811                0.000154          30.368811            1       True          1
14   LightGBM_BAG_L1/T2 -41.578265       0.000095   30.088346                0.000095          30.088346            1       True          2
15   CatBoost_BAG_L1/T1 -43.293445       0.000110   89.405546                0.000110          89.405546            1       True          4
In [66]:
predictions_hpo = predictor_hpo.predict(test)
predictions_hpo.head()
Out[66]:
0    11.541133
1     6.640240
2     6.064966
3     5.733857
4     5.725714
Name: count, dtype: float32
In [67]:
predictions_hpo.describe()
Out[67]:
count    6493.000000
mean      191.760544
std       172.945343
min         5.417264
25%        46.607738
50%       151.285156
75%       283.315765
max       867.180420
Name: count, dtype: float64
In [68]:
# Remember to set all negative values to zero
predictions_hpo_rev = pd.DataFrame(predictions_hpo)
count_neg = len(predictions_hpo_rev[predictions_hpo_rev["count"] < 0])

if count_neg > 0:
    predictions_hpo_rev.loc[predictions_hpo_rev["count"] < 0, ["count"]] = 0
    print("{} Negative predictions were set to zero" . format(count_neg))
    print(predictions_hpo_rev[predictions_hpo_rev["count"]==0])
else: print("{} negatives values were found" .format(count_neg))
0 negatives values were found
In [69]:
submission_hpo = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission_hpo.head()
Out[69]:
datetime count
0 2011-01-20 00:00:00 0
1 2011-01-20 01:00:00 0
2 2011-01-20 02:00:00 0
3 2011-01-20 03:00:00 0
4 2011-01-20 04:00:00 0
In [70]:
# Same submitting predictions
submission_hpo["count"] = predictions_hpo_rev.round(0).astype(int)
submission_hpo.to_csv("submission_hpo.csv", index=False)
In [71]:
submission_hpo.describe()
Out[71]:
count
count 6493.000000
mean 191.763130
std 172.935337
min 5.000000
25% 47.000000
50% 151.000000
75% 283.000000
max 867.000000
In [72]:
!kaggle competitions submit -c bike-sharing-demand -f submission_hpo.csv -m "model with new features and hpo"
100%|█████████████████████████████████████████| 149k/149k [00:00<00:00, 263kB/s]
Successfully submitted to Bike Sharing Demand
In [73]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName                     date                 description                                   status    publicScore  privateScore  
---------------------------  -------------------  --------------------------------------------  --------  -----------  ------------  
submission_hpo.csv           2023-01-04 03:46:50  model with new features and hpo               complete  0.47562      0.47562       
submission_new_features.csv  2023-01-04 02:38:24  model with new features                       complete  0.65341      0.65341       
submission.csv               2023-01-04 02:17:18  first raw submission                          complete  1.79200      1.79200       
submission_hpo.csv           2023-01-04 01:59:41  model with new features and hpo               complete  0.47675      0.47675       
tail: write error: Broken pipe

New Score of 0.47562¶

In [78]:
#Score (high-level hyperparameters only): 0.62542
#Score (high-level hyperparameters only, hyperparameters, and  hyperparameter_tune_kwargs): 0.47562

Step 7: Write a Report¶

Refer to the markdown file for the full report¶

Creating plots and table for report¶

In [83]:
# Taking the top model score from each training run and creating a line plot to show improvement
# You can create these in the notebook and save them to PNG or use some other tool (e.g. google sheets, excel)
fig = pd.DataFrame(
    {
        "model": ["initial", "add_features", "hpo"],
        "score": [53.073174, 30.554656, 36.361050]
    }
).plot(x="model", y="score", figsize=(8, 6)).get_figure()
fig.savefig('img/model_train_score.png')
In [84]:
# Take the 3 kaggle scores and creating a line plot to show improvement
fig = pd.DataFrame(
    {
        "test_eval": ["initial", "add_features", "hpo"],
        "score": [1.79200, 0.65341, 0.47562]
    }
).plot(x="test_eval", y="score", figsize=(8, 6)).get_figure()
fig.savefig('img/model_test_score.png')

Hyperparameter table¶

In [85]:
# The 3 hyperparameters we tuned with the kaggle score as the result
pd.DataFrame({
    "model": ["initial", "add_features", "hpo"],
    "num_stack_levels": [3, 3, 5],
    "num_bag_folds": [8, 8, 10],
    "num_bag_sets": [20, 20, 20],
    "score": [1.79200, 0.65341, 0.47562]
})
Out[85]:
model num_stack_levels num_bag_folds num_bag_sets score
0 initial 3 8 20 1.79200
1 add_features 3 8 20 0.65341
2 hpo 5 10 20 0.47562

References¶

  • AutoGluon Predictors. AutoGluon. https://auto.gluon.ai/stable/api/autogluon.predictor.html#autogluon.tabular.TabularPredictor.fit
  • Predicting Columns in a Table - In Depth. AutoGluon. https://auto.gluon.ai/0.0.15/tutorials/tabular_prediction/tabular-indepth.html
  • Default (fixed) hyperparameter values used in Gradient Boosting model. AutoGluon. https://github.com/autogluon/autogluon/blob/master/tabular/src/autogluon/tabular/models/lgb/hyperparameters/parameters.py#L43-L51
  • How to use AutoGluon for Kaggle competitions. AutoGluon. https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-kaggle.html
  • Time-related feature engineering. Scikit-learn. https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html
  • Feature Engineering Examples: Binning Numerical Features. Towards Data Science. https://towardsdatascience.com/feature-engineering-examples-binning-numerical-features-7627149093d